17 research outputs found

    Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning

    Full text link
    An oft-cited challenge of federated learning is the presence of heterogeneity. \emph{Data heterogeneity} refers to the fact that data from different clients may follow very different distributions. \emph{System heterogeneity} refers to client devices having different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. Using four standard federated learning benchmark datasets, we empirically study the impact of starting from a pre-trained model in federated learning. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40\%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend future work proposing and evaluating federated optimization methods to evaluate the performance when starting from random and pre-trained initializations. This study raises several questions for further work on understanding the role of heterogeneity in federated optimization. \footnote{Our code is available at: \url{https://github.com/facebookresearch/where_to_begin}}Comment: Accepted at ICL

    PaCo: Probability-based Path Confidence Prediction

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / CCR-042971Gigascale Systems Research Cente

    Effective Long-Context Scaling of Foundation Models

    Full text link
    We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences

    Critical Branches and Lucky Loads in Control-Independence Architectures

    No full text
    148 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2009.I perform a thorough analysis of the performance sensitivity of CI processors to disambiguation and forwarding. The insights from this analysis are used to drive the design of hardware mechanisms to perform these two functions that are low in complexity and yet attain high performance. The basic premise behind these mechanisms is to use small caches to perform early disambiguation and forwarding. These caches are not responsible for ensuring correctness; they merely enable high performance in the presence of lucky loads. The caches are backed up by a simple load re-execution mechanism that guarantees correctness. I find that the performance of a CI processor with small (32-entry and 128-entry) structures for disambiguation and forwarding, respectively, is within 10% of global load and store queues in the worst case.U of I OnlyRestricted to the U of I community idenfinitely during batch ingest of legacy ETD
    corecore